log-linear model
On the Accuracy of Self-Normalized Log-Linear Models
Jacob Andreas, Maxim Rabinovich, Michael I. Jordan, Dan Klein
Calculation of the log-normalizer is a major computational obstacle in applications of log-linear models with large output spaces. The problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature. In this paper, we analyze a recently proposed technique known as "self-normalization", which introduces a regularization term in training to penalize log normalizers for deviating from zero. This makes it possible to use unnormalized model scores as approximate probabilities. Empirical evidence suggests that self-normalization is extremely effective, but a theoretical understanding of why it should work, and how generally it can be applied, is largely lacking. We prove upper bounds on the loss in accuracy due to self-normalization, describe classes of input distributions that self-normalize easily, and construct explicit examples of high-variance input distributions. Our theoretical results make predictions about the difficulty of fitting self-normalized models to several classes of distributions, and we conclude with empirical validation of these predictions.
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Texas (0.04)
- North America > United States > New York (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection
Enouen, James, Sugiyama, Mahito
The log-linear model has received a significant amount of theoretical attention in previous decades and remains the fundamental tool used for learning probability distributions over discrete variables. Despite its large popularity in statistical mechanics and high-dimensional statistics, the vast majority of such energy-based modeling approaches only focus on the two-variable relationships, such as Boltzmann machines and Markov graphical models. Although these approaches have easier-to-solve structure learning problems and easier-to-optimize parametric distributions, they often ignore the rich structure which exists in the higher-order interactions between different variables. Using more recent tools from the field of information geometry, we revisit the classical formulation of the log-linear model with a focus on higher-order mode interactions, going beyond the 1-body modes of independent distributions and the 2-body modes of Boltzmann distributions. This perspective allows us to define a complete decomposition of the KL error. This then motivates the formulation of a sparse selection problem over the set of possible mode interactions. In the same way as sparse graph selection allows for better generalization, we find that our learned distributions are able to more efficiently use the finite amount of data which is available in practice. On both synthetic and real-world datasets, we demonstrate our algorithm's effectiveness in maximizing the log-likelihood for the generative task and also the ease of adaptability to the discriminative task of classification.
- North America > United States > California (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Virginia (0.04)
- (4 more...)
Reviews: Trimmed Density Ratio Estimation
Summary: This paper proposes a "trimmed" estimator that robustly (to outliers) estimates the ratio of two densities, assuming an a exponential family model. This robustness is important, as density ratios can inherently be very unstable when the denominator is small. The proposed model is based on an optimization problem, motivated by minimizing KL divergence between the two densities in the ratio, and is made more computationally tractable by re-expressing it in terms of an equivalent saddle-point/max-min formulation. Similar to the one-class SVM, this formulation explicitly discards a portion (determined by a tuning parameter) of "outlier" samples. The density-ratio estimator is shown to be consistent in two practical settings, one in which the data contains a small portion of explicit outliers and another in which the estimand is intrinsically unstable.
Can Transformers Learn $n$-gram Language Models?
Svete, Anej, Borenstein, Nadav, Zhou, Mike, Augenstein, Isabelle, Cotterell, Ryan
Much theoretical work has described the ability of transformers to represent formal languages. However, linking theoretical results to empirical performance is not straightforward due to the complex interplay between the architecture, the learning algorithm, and training data. To test whether theoretical lower bounds imply \emph{learnability} of formal languages, we turn to recent work relating transformers to $n$-gram language models (LMs). We study transformers' ability to learn random $n$-gram LMs of two kinds: ones with arbitrary next-symbol probabilities and ones where those are defined with shared parameters. We find that classic estimation techniques for $n$-gram LMs such as add-$\lambda$ smoothing outperform transformers on the former, while transformers perform better on the latter, outperforming methods specifically designed to learn $n$-gram LMs.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Pseudo-Non-Linear Data Augmentation via Energy Minimization
Hu, Pingbang, Sugiyama, Mahito
We propose a novel and interpretable data augmentation method based on energybased modeling and principles from information geometry. Unlike black-box generative models, which rely on deep neural networks, our approach replaces these non-interpretable transformations with explicit, theoretically grounded ones, ensuring interpretability and strong guarantees such as energy minimization. Central to our method is the introduction of the backward projection algorithm, which reverses dimension reduction to generate new data. Empirical results demonstrate that our method achieves competitive performance with black-box generative models while offering greater transparency and interpretability. Data augmentation has advanced significantly in recent years, primarily due to the increasing use of generative models to meet the growing demand for large datasets (Feng et al., 2021; Wong et al., 2016). Despite their success, these generative models often rely on modern deep neural networks, which are typically treated as black boxes, raising concerns about their interpretability (Guidotti et al., 2018). For instance, the popular autoencoder model encodes original data into a compact latent representation and then decodes it back, with both processes usually handled by black-box neural networks (Kingma & Welling, 2022). Consequently, even when these models perform well, the lack of understanding of the underlying transformations makes it difficult to control the generated outputs, forcing researchers to depend heavily on empirical heuristics. A natural approach to developing a more interpretable data augmentation method is to replace blackbox transformations with more explicit ones (Rudin, 2019).
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > Russia (0.04)
- Europe > Poland (0.04)
- (5 more...)
Duality induced by an embedding structure of determinantal point process
Specifically, we clarify the embedding structure of a DPP model in the exponential family of log-linear models (c.f., Agresti, 1990; Amari, 2001) in Theorem 1. Models embedded in exponential families are called curved exponential families. Information geometry (Amari, 1985) provides a measure, the e-embedding curvature tensor (Efron, 1975; Reeds, 1975; Amari, 1982; Sei, 2011), to quantify the extent to which a curved exponential family deviates from an exponential family. To check the e-embedding curvature as well as the Fisher information matrix, we apply the diagonal scaling (Marshall and Olkin, 1968), also known as the quality vs. diversity decomposition in the DPP literature (Kulesza and Taskar, 2012), to an L-ensemble kernel of a DPP model and then evaluate them, which clarifies that the subset of parameters related to the item-wise effects (quality terms) has zero e-embedding curvature (Corollary 1).
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)
A Convergence Analysis of Log-Linear Training
Log-linear models are widely used probability models for statistical pattern recognition. Typically, log-linear models are trained according to a convex criterion. In recent years, the interest in log-linear models has greatly increased. The optimization of log-linear model parameters is costly and therefore an important topic, in particular for large-scale applications. Different optimization algorithms have been evaluated empirically in many papers. In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly ill-conditioned. We verify our findings on two handwriting tasks. By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain (0.04)
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.86)
On the Accuracy of Self-Normalized Log-Linear Models ∗, Maxim Rabinovich
Calculation of the log-normalizer is a major computational obstacle in applications of log-linear models with large output spaces. The problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature. In this paper, we analyze a recently proposed technique known as "self-normalization", which introduces a regularization term in training to penalize log normalizers for deviating from zero. This makes it possible to use unnormalized model scores as approximate probabilities. Empirical evidence suggests that self-normalization is extremely effective, but a theoretical understanding of why it should work, and how generally it can be applied, is largely lacking. We prove upper bounds on the loss in accuracy due to self-normalization, describe classes of input distributions that self-normalize easily, and construct explicit examples of high-variance input distributions. Our theoretical results make predictions about the difficulty of fitting self-normalized models to several classes of distributions, and we conclude with empirical validation of these predictions.
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)